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General Series Introduction 

Orsett Technical Reports are designed to allow the 
exploration of specific topics in detail. Series A 
contains four reports on different aspects of the student 
evaluation of teaching effectiveness (SETE) or students' 
ratings of instruction (SRI) . This is the rating of 
lecturers and teachers by their students. 

REPORT No.l ' 

This report is a literature review of the studies 
into SETE and SRI, mostly from the USA. The aim is to 
outline what students see as the "ideal lecturer". Much 
of the material comes from the prolific work of Kenneth 
Feldman . 

REPORT No. 2 2 

This report addresses the issue of the accuracy of 
students' ratings of their instructors. Is it an accurate 
picture of their teaching effectiveness or the personal 
feelings of the students? The issues of reliability, 
generalisability, and validity of the ratings, along with 
rating errors, are discussed. 

REPORT No. 3 3 

Report no . 3 takes many of the technical issues 
raised in report no . 2 further. In particular, the 
potential biases to SETE and SRI . 

REPORT No. 4 4 

This report gives details of the construction of the 
Birmingham Overseas Student Teaching Evaluation 
Questionnaire (BOSTEQ) . The aim is to produce a rating 
instrument specifically to be used by overseas students. 

The research is part of an MSc degree at the 
University of Aston 5 . 
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ESTABLISHING RELIABILITY, GENERALISABILITY 
AND VALIDITY OF STUDENT RATINGS OF 
INSTRUCTION 

METHODS USED TO ESTABLISH RELIABILITY 
1. INTERNAL CONSISTENCY. 

Using for example, odd-even or split-half, and 
coefficient alpha (Cronbach 1951) or Kuder-Richardson 
formulas (Kuder and Richardson 1937) . 

The aim is to correlate various questions within the 
instrument. Studies have shown good internal consistency. 
For example, Remmers and Weisbrodt (1965) Purdue Rating 
Scale for Instructors (PRSI) shown to have correlation of 
between 0.67 to 0.91 using Horst method (Horst 1949) . 

Costin et al (1971) quote correlations ranging from 
.77 to .94 for randomly paired students within a class. 
Feldman (1977) reports an extension of this approach, 
where two mean scores for a particular item can be 
obtained by randomly dividing a class in half. The 
resulting correlation is corrected by the Spearman-Brown 
Prophecy Formula, and it produces correlations between 
.70s and .90s (see Guilford 1954 for more details) . Most 
of the commonly used instruments report reliability 
coefficients over 0.50. Table 1 shows a selection of post 
1975 studies and the reliability coefficients reported. 

But "simply computing the internal consistency of an 
entire questionnaire would be inappropriate unless the 
whole instrument were intended to measure a single 
quality and produce a single summary score across items" 
(Doyle 1975 p35) . 



2. TEST-RETEST. 

Here the rating instrument is given to the same 
subjects at two different times. The aim being to 
correlate the two scores of each subject. For example, 
Remmers and Brandenburg (1927) administered the PRSI 3 
days after the original use with the same group, and 
found a correlation between 0.42 to 0.92. 

But the instructor may change between 
administrations of the instrument, and so a small 
correlation will suggest that the instrument is unstable 
This method is also criticised for "being a test of the 
student's memory instead of being a measure of 
reliability" (Frey 1978 p85) . 
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Name of Study Instrument Method of Reliability 

/Sample establishing coefficient 
reliability 

Frey (1978) Endeavour Inter-rater 0.61 skill 

26 787 0.32 rapport 

students at 

Northwestern 

University 

Marsh (1982) SEEQ (1) intraclass (1) 0.74-0.90 
250 000 in (2) coefficient (2) 0.88-0.97 
4 years at alpha 
University 
of Southern 
California 

Watkins SEEQ/ coefficient alpha 0.54-0.93 overall 
and Thomas Endeavour 0.88 (SEEQ median) 

(1991) combined 0.87 (Endeavour 
111 Indian median) 
students 

Fernandez CUTEQ-R coefficient alpha 0.97-0.98 
and Mateo 36 589 

(1992) students at 
Universidad 
Comlutense 

Watkins SEEQ/ coefficient alpha 0.68-0.93 overall 
and Akande Endeavour 0.92 (SEEQ median) 

(1992) combined 0.91 (endeavour 

158 undergrads median) 

in Nigeria 

Watkins SEEQ/ coefficient alpha 0.85-0.97 overall 

and Gerong Endeavour 0.93 (SEEQ median) 

(1992) combined 0.94 (Endeavour 

77 undergrads median) 

in Philippines 

Watkins SEEQ/ coefficient alpha 0.54-0.84 overall 

and Regmi Endeavour 0.79 (SEEQ median) 

(1992) combined 0.73 (Endeavour 

with 297 median) 

Nepalese 

students 

Table 1 - showing the coefficient of reliability found by 
selected post 1975 studies. 



3. MEAN RATINGS. 

It is assumed that mean ratings of instructors 
should be different, because the instructors display 
different teaching behaviour. If the means are similar or 
identical, the ratings are seen as biased. 

Whitely, Doyle and Hopkinson (1973) used this method 
in a large multi-section course; finding that the mean 
ratings varied between instructors. 
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But the assumption that instructors do differ is 
open to question. However, enough research has shown that 
the distinction between "good" and "bad" lecturers can be 
established (eg: Marsh 1977) . 

Frey (1978) used a variation of this method. He 
chose a sample of the data representing instructors who 
had taught three or more classes (with 10 + students in 
each), which had filled in ratings. Variance estimates 
were calculated for differences among instructors, and 
differences among classes within instructors - inter 
rater agreement. A formula used (recommended by Ebel 
1951) showed the proportion of observed variance due to 
differences in instructor. 



4 . ANOVA . 

Proposed by Guilford (1954) : rather than attempting 
to remove potential bias, it aims to identify the 
contribution of bias to the final rating, and adjust for 
it. Obviously, this has advantages because some potential 
biases cannot be easily separated (eg: the halo effect) . 

For example, Treffinger and Feldhusen (1970), using 
this method, found that the halo effect only accounted 
for 10% of the variance in students' ratings (quoted in 
Doyle 1975 p43) . 



5 . INTER-RATER RELIABILITY. 

This looks at the consistency of ratings among 
people. Reliability here is when all raters in a group 
give the same pattern of responses. Usually estimated by 
intra-class correlation coefficients, ie: the comparison 
of ratings within one class of one lecturer with ratings 
of different instructors. Because it is sensitive to the 
number of raters, Centra (1979) suggests intra-class 
correlations of .70s for 10 raters through to .90s for 
20 (p27) . 

Feldman (1977) makes a number of points about 
interpreting the reliability coefficients: 

i) "reliability coefficients of individual ratings 

indicate the degree of general or relative consistency 

among raters; they do not measure exact or absolute 
agreement" (p229); 

ii) inter-rater agreement is only the degree to 
which independent raters give the same rating for the 
same lecturer; 

iii) inter-rater reliability is "the degree to which 
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the ratings by different raters are proportional when 
expressed as deviations from their means" (p229); 

iv) the reliability coefficients of average college 
student ratings may be high, but this does not mean that 
individual students within the classes are highly 
consistent in their ratings; 

v) consistency in ratings among students may not be 
a good basis for estimating individual ratings or average 
ratings reliability, particularly if the aim is to 
compare ratings across situations. Guthrie (1927) 
suggests that student ratings agree at the end of the 
term because of greater exposure to the lecturer, or 
student gossip. 



GENERALIZABILITY OF STUDENT RATINGS 

This is the question of whether ratings of lecturers 
can be compared across situations. In a detailed analysis 
Bausell et al (1975) compared teaching behaviours in five 
situations : 

i) Same course taught by same instructor on two 
separate occasions (CS-IS) . 

ii) Same course taught by two different instructors 
(CS-ID) . 

iii) Different courses taught by same instructor 
(CD-IS) . 

iv) Different course taught by different instructors 
within the same department (CD-ID) . 

v) Different courses from different departments 
taught by different instructors (CD '-ID') . 

It was found that generalisation was possible with 
all teaching behaviour in CS-IS; with some behaviour in 
CS-ID and CD-IS, but not CD-ID and CD'-ID', as expected. 
The authors conclude that student ratings do replicate as 
a whole across time, even if individual items are 
unclear . 

Table 2 shows a summary of the correlations found by 
selected studies. 

But Smith and Cranton (1992) suggest care. They talk 
about the "normative assumptions" of a class, which would 
restrict generalizability . For example, large classes in 
certain subjects may see "organisation" as very 
important, while for smaller classes in other subjects, 
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Feldman 




(1978) 


2182 




students 


Marsh 




(1982b) 


8277 




classes 



STUDY SAMPLE CS-IS CS-ID CD-IS CD-ID 

Bausell 

et al (1975) Unknown .69 .33 .17 .07 (same 

dept ) 
.00 (diff 
dept) 

66 .16** .46* 



70 .14 .52 .06 (same 

dept) 

* quotes mean of Hogan (1973) and Seller et al (1977) 
** quotes only Hogan (1973) . 

Table 2 - showing the mean correlation coefficient of 
student ratings in different situations by selected 
students . 



it may be the importance of "interaction" factors. The 
authors emphasis that student ratings are "rather 
specific to the instructional setting" (p762) . 

More recently, Marsh and Bailey (1993) have looked 
at the generalisability of teaching characteristics - for 
example, is a lecturer who is enthusiastic, but not 
organised, judged the same in all courses? Using over one 
million SEEQ forms, during a 13 year period, the authors 
are happy that "instructors appear to have distinct 
profiles of strengths and weaknesses that are highly 
generalisable" (pll) . 

Feldman (1978) takes up the issue of whether the 
samples of students used in ratings are from populations 
of comparable raters. It is not always possible to assume 
that the samples are random, because students self-select 
themselves for courses. Thus the samples are classed as 
coming from a population "like those observed". This 
makes it possible to correlate the average class rating 
between two classes taught the same course by the same 
lecturer. The correlations are between .60s and .70s 

(Feldman 1978 p201) . 

However, the correlations leave room for other 
factors - for example, the course context. Feldman 's 

(1978) conclusion is that comparison of lecturer's 
ratings can only be of similar sized classes, similar 
subjects, and similar "requiredness" . 



ESTABLISHING VALIDITY OF STUDENT RATING 

Gaski (1987) produces evidence of studies both 
supporting (table 4) and non-supporting (table 3) of the 
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Rodin 


and 


Rodin 




(1972) 




Snyder 


and 


Clair 




(1976) 




Pratt 


and 


Pratt 




(1976) 




Brown 




(1976) 





validity of student evaluations 



STUDY SAMPLE RESULTS 

293 Inverse partial correlation between 
students objective measure of amount learned and 
student rating (with initial ability 

controlled for) 

72 Expected grades inversely related to 
students evaluations; perceived obtained grades 

positively related 

175 Very little correlation between obtained 
students grades and student ratings; strong 

positive correlation between expected 

grades and ratings 

2360 In stepwise regression, grades represent 

sections; more powerful predictor of ratings 

30 000 (r=.353) than any other hypothesized 

student antecedent 
ratings 

Powell 5 Ratings of instructor falls as grading 

(1977) sections; stringency increases; amount learned 

35-45 increases as grading stringency increases 

students 

per section 

(Based on Gaski 1987 p327) 

Table 3 - showing research generally non-supportive of 
validity of student evaluations. 



Using the most widely accepted objective measure of 
validity, the student achievement test, Feldman (1989b) 
summarises the studies, and finds a correlation with each 
individual characteristic of teaching (table 5) . Not 
surprisingly, significant correlations are found for 
"teacher's preparation" and "clarity and 
understandableness " , and the student achievement test. 

Braskamp et al (1985) summarise the conclusions for 
the use of student achievement tests as an indicator of 
student learning. 

1. Different instructors teaching same course can be 
compared in terms of student performance on common exam, 
if classes similar in ability, prior knowledge and 
motivation . 

2. Pre and post-course test score differences can be 
used to obtain an index of learning. 

3. A pre-established number of students in a course 

Student Evaluation of Teaching Effectiveness: Methodological Issues - Part 1 

ISBN: 978-0-9540761-5-3 Kevin Brewer 2002 9 



STUDY SAMPLE RESULTS 

Gessner 78 High correlation between student 
(1978) students evaluation and performance 

Frey (1973) 13 Strong relationship between student 

instru- rating and teaching guality (defined as 

ctors; difference between observed final exam 

354 and score predicted by Scholastic 
students Aptitude Test profile) 

Marsh, 18 Student evaluation (across sections) 
Fleiner & sections; positively correlated with final exam 
Thomas (1975) 720 students 

Marsh (1977) 62 inst- Evaluations validated with retrospective 
uctors; reports of most/least outstanding 
591 classes; 
1847 students 

Marsh, 51 instr- Factor analysis indicated similar 
Overall & uctors; student-faculty evaluations dimensions; 
Kesler (1979) 83 median r=.49 across evaluative factors; 

courses higher SR for courses instructor rated as 
most effective 

Marsh 31 Generally and moderately + relationship 

& Overall sections; between SR and teaching effectiveness 

(1980) 960 criteria, including final exam grade (36 

students of 60 correlations significant) 

Howard Two Weak + relationship between expected 

& Maxwell expts: grades and student satisfaction; student 

(1980) i)8551 motivation and performance explained more 

courses of variation in satisfaction 

from 58 

schools; 

200 000 

students; 

ii) 50 students 

each from 

19 classes 

Marsh (1982c) 329 General agreement between student and 
classes instructor ratings in MTMM analysis 

Howard, 43 instr- Student and former student ratings 
Conway & uctors; reported superior in 

Maxwell 34 convergent/discriminant validition to 
(1985) students/ other methods ie: self , colleagues and 

classes; trained observer ratings 

30 former 

students/ 

instructors 

(Based on Gaski 1987 p327) 

Table 4 - showing research generally supportive of 
validity of student evaluations. 
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14 


+ , 


,35 * 


10 


+ , 


,26 * 


9 


+ , 


,24 




No 


correlation given 


18 


+ . 


,44 * 


24 


+ , 


,47 * 


6 


+ , 


,33 * 



Instructional Dimension Number of Weighted simple 

Studies average correlation 

1. Teacher's stimulation of 
interest in the course and its 
subject matter 

2. Teacher's enthusiasm 

3. Teacher's knowledge of subject 

4. Teacher's intellectual 

expansivenes 2 

5. Teacher's preparation 

6. Clarity and understandableness 24 

7. Teacher's elocutionary skills 

8. Teacher's sensitivity to, 
and concern with, class level 

and progress 11 +.27 

9. Clarity of course 

ob jectives/reguirements 6 +.32 

10. Nature and value of course 

material 10 + . 17 

11. Nature and usefulness of 
supplementary materials and 

teaching aids 4 -.10 

12. Perceived outcome or 

impact of instrument 

13 . Instructor ' s fairness 

14 . Personality characteristics 

15. Nature, guality and freguency 

of feedback from teacher 13 +.22 * 

16. Teacher's encouragement of 

guestions and discussion 18 +.34 * 

17 . Intellectual challenge and 
encouragement of independent 

thought 7 +.23 

18. Teacher's concern and respect 

for students 11 +.22 * 

19. Teacher's availability 

and helpfulness 13 +.33 * 

20. Teacher motivates students 

to do their best work 3 +.33 

21. Teacher's encouragement of 

self-initiated learning 1 -.52 

22. Teacher's productivity 

in research no cases 
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14 


+ .40 


16 


+ .25 


6 


+ .23 



9 


+ .05 


4 


+ .25 


3 


+ .23 


1 


+ .10 



Instructional Dimension Number of Weighted simple 

Studies average correlation 

23 . Difficulty in course 

(description) 11 +.07 

24 . Difficulty in course 

(evaluation) 

25. Classroom management 

2 6 . Pleasantness of classroom 
atmosphere 

27 . Individualization of teaching 1 

2 8 . Instructor pursued and/or 

met course objectives 2 +.46 * 

29. Overall rating of lecturer as 

an item of multi-item indicator 1 +.36 

30. Overall rating of teacher as 

an item of multi-item indicator 3 +.38 * 

31. Overall rating of course as 

an item of multi-item indicator no cases 

(* = significant two-tailed p<0.001) 

Table 5 - showing a summary of the results of studies 
relating specific evaluations of teaching to student 
achievement as found by Feldman (1989b) . 



who answer correctly a specified percentage of test items 
can be used as an indicator of student learning (p62) 6 . 

Using construct validity requires the correlation of 
student ratings of a lecturer with other evaluations. 
Braskamp et al (1985) summarise the conclusions on 
lecturer self-evaluation, classroom observations by 
outsiders, and alumni ratings. 

- Lecturer self-evaluation: 

1. Students and self evaluation generally good 
reliability in agreement on overall ratings ' . 

2. Agreement between students and self evaluation 
on dimensions of student involvement, teacher 
support and instructional skill 8 . 



6 Based on Clark (1980). 

7 Conclusions based on Blackburn and Clark (1975); Braskamp et al (1979); Doyle and Crichton 
(1978); Marsh, Overall and Kesler (1979a). 

8 Conclusions based on Braskamp et al (1980); Marsh (1980). 
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3. Self ratings not influenced by age, sex, 
tenure status, teaching load, or years of 
teaching experience 9 (p71) . 



Colleagues ratings of instruction: 

1 . An observer may affect teaching-learning 



process 



: c 



2 . Not reliable - no agreement with other methods 
on instructional effectiveness . 

3. The relationship between observed behaviour 
and student learning is not very strong 12 . 

4 . Colleagues ratings not highly related to 
student ratings, if class time was well spent 
and instructor open to other viewpoints 1J . 

5. Agreement between colleagues and students on 
specific instructional practices. They agree on 
descriptions of activities, but not on their 
judgments of instructional quality ll . 

6. Colleagues are more generous than students 
in their ratings '' (p66) . 



- Alumni ratings: 

1. Same students agree between course and 1 year 
after graduation ' 6 . 

2. Alumni of 5 years and current students show 
good agreement on overall teaching effectiveness 

3. Alumni ratings lower than current students "" 
(p74) . 



17 



Conclusions based on Doyle and Webber (1978). 
Conclusions based on Fuller and Manning (1973). 
Conclusions based on Centra (1975). 
Conclusions based on Braskamp et al (1985). 
Conclusions based on Centra (1975). 
Conclusions based on Centra (1975). 
Conclusions based on Braskamp et al (1985). 
Conclusions based on Overall and Marsh (1979). 
Conclusions based on Centra (1974). 
Conclusions based on Overall and Marsh (1979). 
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USE OF MTMM 

A technique being used more and more is the multi- 
trait multi-method matrix (MTMM) . This is basically a 
series of correlations of expected and unexpected 
behaviour with test scores. 

Murphy and Davidshofer (1988) summarises three 
points that a test will possess as established 
effectively by MTMM. 



"1. Scores on the test will be consistent with 
scores obtained using other measures of the same 
construct . 

2. The test will yield scores that are not 
correlated with measures that are theoretically 
unrelated to the construct being measured. 

3. The method of measurement employed by the test 
shows little evidence of bias" (pl06) . 



In their original article, Campbell and Fiske 
proposed a series of rules to follow for evaluating 
convergent and discriminant validity. 



1. The convergent validity coefficients should be 
statistically significant and sufficiently 
different from zero to warrant further examination 
of the validity. 

2. The convergent validities should be higher than 
correlations between different traits assessed by 
different methods. 

3. The convergent validities should be higher than 
correlations between different traits assessed by 
the same method. 

4. The pattern of correlations between different 
traits should be similar for each of the different 
methods (guoted in Marsh and Hocevar 1983 p233) . 



The above rules have been criticised. Firstly, over 
what constitutes a satisfactory result. 

Secondly, the use of correlations based on observed 
variables to draw conclusions about underlying factors 
(Kenny and Kashy 1992 pl65) . 

The ANOVA approach (Kavanaugh et al 1971) or the 
factor analysis approach (Jackson 1969) have been 
suggested separately to overcome the weaknesses of the 
MTMM matrix. However, there is not universal agreement, 
especially over which technique of factor analysis to 
use. Kenny and Kashy (1992) review a number of techniques 
with the MTMM matrix: 

• equal loading model (Alwin 1974); all traits and 
methods in the matrix are allowed to correlate; 

• correlated uniqueness model (Kenny 1979); no method 

Student Evaluation of Teaching Effectiveness: Methodological Issues - Part 1 

ISBN: 978-0-9540761-5-3 Kevin Brewer 2002 14 



factors created; 
• fixed method model (Bock and Bargmann 1966); reducing 
one of method factors. 

The authors conclude that all approaches using 
factor analysis have problems, and thus establishing 
convergent and discriminant validity is difficult. 

In an earlier paper, Marsh and Hocevar (1983) 
compared ANOVA and confirmatory factor analysis (CFA) ; 
and recommended the latter as having specific advantages 
for use in the MTMM matrix. 



MULTI-SECTION COURSES OR MTMM DESIGN FOR VALIDATION? 

Attempts have been made to establish validity by 
using large multi-section courses, where different groups 
of students are presented the same material by different 
instructors . 

Ideally the following controls should be used: 

• many sections to the course; 

• random assignment of students to the sections; 

• pre-test measures used; 

• each section taught by separate instructors; 

• the final examination graded externally; 

• common textbooks among the sections (Marsh 1984 p720) . 
• 

Validity is then assessed by correlating the student 
ratings in each section. 

But this does not mean a perfect methodology: each 
section is usually small; the problem of the influence of 
presage variables, like initial student motivation; the 
lack of consistency in measure of course achievement and 
student ratings. In fact Marsh goes as far as to say that 
this design is inherently weak (1984 p721) . 

Abrami, d'Appollonia and Cohen (1990) take up the 
defence of this methodology by reanalysing 43 multi- 
section validity studies. They argue that the 
inconsistencies of past studies were due to lack of 
proper analysis, and which "lacked the sensitivity 
necessary to identify characteristics that explain a 
medium size effect on the relationship between ratings 
and achievement" (p230) . 

Abrami et al (1990) point out that over 40 studies 
have used multi-section courses for validation of student 
ratings. The design is high in internal validity, allows 
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for some control between sections, and a common 
examination reduces the influence of other factors. The 
examination score is high in external validity - it is a 
direct measure of effective teaching. 

Marsh (1987) advocates MTMM designs, because of the 
greater control on threats to internal and external 
validity. But to show that correlations between student 
ratings and, for example, instructor self-ratings are 
adequate measures of instruction is the problem. 



ARE STUDENT RATINGS SINGLE OR MULTI- 
DIMENSIONAL? 

Doyle (1983) originally proposed that all teaching 
behaviour could be covered by including 3 summary 
questions : 

i) how would you rate this instructor's overall 
teaching ability? 

ii) how would you rate the overall effectiveness of 
this course? 

iii) how much have you learned as a result of this 
course? (Doyle 1983 p36) . 

The issue of whether student ratings are assessing a 
single or multi-dimensional behaviour in teaching became 
a hotly debated issue, particularly with the publication 
of a series of articles in the Journal of Educational 
Psychology in 1991. Herbert Marsh is the main proponent 
of a multi-dimensional approach to student ratings, while 
Abrami disagrees. 

Marsh (1984) has no doubt that student ratings 
"should be unequivocally multi-dimensional (eg: a teacher 
may be quite well organised but lack enthusiasm)" (p709) . 
He is against a selection of items which are then 
summarised by an average. 



If a survey contains a hodgepodge of different 
items and student ratings are summarised by an 
average of these items or an overall rating, then 
there is little basis for knowing what is being 
measured (Marsh 1983 pl51) . 



In a recent article, Cashin and Downey (1992) 
reviewing the whole debate between Marsh and Abrami, 
point out that a major obstacle to resolving the debate 
is the "lack of any agreed on criterion measure of 
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instrumental effectiveness" (p564) . Using the 
Instructional Development and Effectiveness Assessment 
(IDEA) student rating system (Hoyt 1973), the authors are 
forced to accept student learning, with controls for 
possible bias, as the criterion of effective teaching. 



The results of this study have supported that 
single, global items - as suggested by Abrami 
(1985) - can account for a great deal of the 
variance resulting from a weighted composite of 
many multi-dimensional student rating items 
(Cashin and Downey 1992 p569) . 



RATING SCALES 

Doyle (1975) sees ratings composed of scales with a 
response mode provided (eg: agree/disagree); stems that 
pose a question; and cues or anchors using adjectives or 
phrases to define the points on a scale. In a later 
book, Doyle (1983) extends the various ratings that could 
be used to include graphic, adjectival and numerical 
scales; Bars (Behaviourally Anchored Rating Scales); 
forced choice scales; variable-item ratings; and mixed 
formats . 

Flood Page (1974) lists examples of early rating 
forms, and the different systems they use. The most 
popular is the list of desirable teacher qualities, and 
the students must rate their teacher on each of them. The 
most common used rating is a scale with a numerical 
score, where 1 = "poor" to 5 = "excellent". 

But how many points should be on the scale? 4 or 5 
is most common. Sharpness and reliability is reduced with 
increasing the number of points. Wherry (1952) 
constructed a scale with 25 points, while Doyle (1975) 
recommends avoiding extremes. 

An alternative to traditional scales is a double- 
scale format. For example, Gagne and Allaire (1974) got 
students to rate the instructor as they are now, and how 
they would want them to be. The difference between the 
two scores is used as an index of satisfaction or 
dissatisfaction. 

A variation of this format involves a profile of the 
student's instructional needs (self ratings), and a 
description of what the course offers relative to 
satisfaction of each of those needs (Doyle 1975) . 

Braskamp et al (1985) summarise research on the 
instrumentation in student evaluation. 
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i) Placement of items - specific items placed before 
global items have a minimal effect on overall ratings. 
Thus global ratings can be placed at either the beginning 
or the end of the survey (Ory 1982) . 

ii) Number of response alternatives - 6 point 
response scales yield higher item reliabilities than 5 
point response scales. Thus global items should use more 
5 point response scales (Masters 1974) . 

iii) Negative wording of items - overall ratings are 
not significantly affected by number of negatively worded 
items. Thus both positive and negative worded items can 
be used (Ory 1982) . 

iv) Labelling all scale points vs labelling only end 
points - labelling only end points yields slightly higher 
means. Thus the response format used should be consistent 
for all items (Frisbie and Brandenburg 1979 quoted in 
Braskamp et al 1985 p45) . 

It is also necessary to ask whether the term 
"satisfied" and "dissatisfied" should be used. Peterson 
and Wilson (1992) show how the manner in which the 
question is asked about satisfaction can influence the 
response. They asked a question about cars; either as 
"how satisfied" or "how dissatisfied" . The first question 
produced a 91% response of "very" or "somewhat 
satisfied", and the second only 82% "Posing a 
satisfaction question in a positive form appears to lead 
to greater reported satisfaction than posing it in a 
negative form" (p65) . 

Questions asked earlier in the questionnaire 
influence subsequent answers. Peterson and Wilson (1992) 
found that "asking a general satisfaction question prior 
to a specific vehicle satisfaction question slightly 
increases the tendency for a 'very satisfied' response to 
the vehicle question" (p66) . 

Panney (1977) compared two versions of a rating form 
- one biased towards a lecturer's strengths, the other 
towards weaknesses. The response to the global items at 
the end of the rating form were as expected. 

McClendon and O'Brien (1988) found that question 
order had an effect on questions of satisfaction with 
life. The placing of specific questions before general 
questions is important: "respondents must think about 
specific life domains in order to answer the general 
questions" (p361) . For example, a general well-being 
question will be effected by early specific questions 
about marriage satisfaction. 
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This could have consequences for rating instruments 
asking specific questions, and then finishing with a 
question about overall evaluation of the teacher. 

Schuman and Presser (1981) talk in detail about 
question construction on attitude surveys, including open 
versus closed questions; "don't knows" problem; middle 
position on scales; tone of wording; and order effects of 
questions . 



RATING ERRORS ON STUDENT RATINGS OF 
INSTRUCTION 

RATING ERROR 

All ratings contain an element of measurement error. 
Forced-choice scales are an attempt to reduce this. The 
rater must choose, for example, two items from a list of 
four equally desirable. Sharon (1970) found this type of 
rating did not differ across four conditions, while a 
usual scale did. 

But these scales have been criticised as difficult 
for raters, among other problems (Doyle 1975) . 

Research has tried to identify student 
characteristics that could bias ratings of teachers 
(reviews by Feldman 1977; 1978; 1979) . When correlations 
between characteristics and ratings are large, there is 
seen to be bias, and the ratings lack validity. But Marsh 
(1987) argues that validity is lost only when "biasing" 
characteristics influence the ratings, and not the 
instructional effectiveness criteria at the same time, 
and vice versa. 



BIAS IN STUDENT RATINGS 

Two areas of bias that have particularly concerned 
researchers are the effect of implicit theories ("halo 
effect"), and the semantic similarity of items. 



Implicit Theories 

This is the idea that if the raters notice certain 
characteristics in the teacher, then they assume the 
teacher must also have certain others. Whitely and Doyle 

(1976) feel that students' implicit theories influence 
their rating of teaching. Using latent partition analysis 

(Wiley 1967), they identified latent clusters of 
behaviour, when the unit of analysis was total-class, 
within-class or between-class ratings. But the authors 
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feel that the implicit theories are based on experience 
of teaching, and so are quite accurate to which 
behaviours are associated together. 

Larson (1979) is critical: 



we still have no way of knowing whether a 
particular set of behaviour ratings reflects the 
actual behaviour of those being rated or whether 
they reflect population based normative assumptions 
about these behaviours (p210) . 



More recently, Widmeyer and Loy (1988) replicated 
Kelley's (1950) "first impressions" experiment finding 
that subjects who were told that the lecturer had a warm 
personality rated them as a more effective teacher, than 
subjects who were told the lecturer had a cold 
personality. The "warm personality" was also seen as less 
unpleasant, more sociable, less irritable, less ruthless, 
more humorous, less formal and more humane (pll9) . Full 
details of the results in table 6. 

Marsh (1987) points out that implicit theories can 
be ruled out by establishing a factor structure similar 
to that of SRI from another method. For example, the use 
of lecturer's self evaluation. This method suffers from 
little or no "halo effect", while colleagues' evaluations 
may suffer most. 



Item: Teaching ability "Warm" "Cold" Signif- 

Group Group icance 

Knows his material - doesn't 1.44 1.65 .05 

Considerate of class - self 

centred 
Intelligent - unintelligent 
Organised - not 
Expresses himself well 

- difficulty 
Interesting - boring 

7 point scale used: 1 = left hand end to 7 = right hand end. 
(Based on Widmeyer and Loy 1988 pl20) 

Table 6 - showing mean ratings given to a stimulus person 
designated warm or cold. 



Semantic Similarity of Items 

This is slightly different to the implicit theories, 
in that the raters score items because they appear to be 
similar to other items. For example, lecturers who are 
"friendly towards individual students" will be assumed to 
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1.90 


2 .24 


.01 


1.59 


1. 84 


.05 


1.77 


1.73 


NS 


2.09 


2 . 10 


NS 


2.05 


2 .32 


.05 



also make "students feel welcome in seeking help/advice". 
Cadwell and Jenkins (1985) found evidence of this process 
using a hypothetical instructor profile with 28 graduate 
students. They explain the cognitive processes involved 
in responding to a SRI, which by its nature must lead to 
this bias . 



Thus, because student ratings are the product of 
cognitive processes that reconstruct rather than 
mirror instructor behaviour, these ratings, like 
all personality assessments, lead us to view 
behaviour as more organised and more consistent 
than it actually is (p392) . 



This study has been criticised at length (Marsh and 
Groves 1987), particularly because of the use of a 
"hypothetical profile". 

Again, this bias can be eliminated by construct 
validation . 



Other Bias 

1 . "SIMPLISTIC BIAS HYPOTHESIS". 

This states that if an instructor gives high grades, 
demands little work, or teaches only small classes, they 
will receive a higher rating. Marsh (1987) quotes his own 
earlier research, which he believes clearly refute this, 
and showed it to be a "strawman" (p310) . The use of the 
Student Educational Evaluation Questionnaire (SEEQ), and 
multi-dimensions to the ratings, reduces the possibility 
of a global item influenced by the above factors. 
Furthermore, the dimension of Workload/Difficulty was 
opposite to this "hypothesis". 



2. LENIENCY ERRORS. 

The tendency to rate generously for those people the 
rater is involved with. Centra (1975) found that 
colleagues' ratings (mean of 4.47 out of 5) of teaching 
effectiveness was one standard deviation higher than 
students' (mean of 3.98) . Other studies have found 
slightly different results. Doyle (1983) feels that "some 
degree of leniency error can be expected in most 
evaluation" (p75), but it is higher for colleagues' 
evaluations . 
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DOES FEEDBACK CHANGE THE THEACHING 
PERFORMANCE? 

It is generally felt that student feedback should 
improve teaching performance. Tuckman and Oliver (1968) 
show that high school teachers who received feedback were 
later rated higher than those teachers who did not 
receive feedback. But Miller (1971) found no significant 
difference in the end of term ratings between those 
teaching assistants who received mid-term feedback 
and those who did not. 



Remmers (1959) makes two important points: 

"1. Knowledge of student opinions and attitudes 

leads to improvement of the teacher's personality 

and educational procedures. 

2. Students are more favourable to student ratings 

than instructors, but more instructors have 

noticed improvement in their teaching as a result 

of student ratings than the studies have" 

(Quoted in Flood Page 1974 p68) . 



Wilson (1986) details a scheme to improve faculty 
teaching at the University of California, Berkeley. 
Improvement was found on nine characteristics of half of 
the lecturers due to feedback. The technique used was 
consultation with lecturer using student comments. 

Marsh (1987) reviews the two main type of studies 
aimed at answering the question of the effect of feedback 
on teaching. 

i) Short term feedback studies - generally it is 
felt that feedback and consultation can improve teaching. 
Cohen's (1981) meta-analysis of feedback studies found 
that instructors receiving mid-term feedback were rated 
higher than the control group on overall rating. 
L'Hommedieu et al (1990) point out the problem of the 
"John Henry effect", ie: teachers who know they are 
being rated tried to improve their teaching. 

ii) Long term feedback studies - Marsh feels that 
there are so few studies, and many problems with such 
research, that it is difficult to reach a conclusion. 

In a recent study, Marsh and Roche (1993) looked at 
the effect of feedback from students and consultation 
mid-term and end of term, on the evaluation at the end of 
the course. The ratings improved for both groups, but 
only ratings for the end of term group improved 
significantly more than the control group (no feedback) . 
The authors conclude that "SET (student evaluation of 
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teaching) feedback coupled with consultation is an 
effective means to improve teaching effectiveness" (p 
217) . 

Ryan et al (1980) looked at SRI from the point of 
morale of the faculty. Over 90% of the 193 academics felt 
morale had "greatly or somewhat decreased" through the 
use of student ratings of their teaching. Furthermore, 
nearly 45% reported a decrease in job satisfaction, and 
over 70% a decrease in their confidence in the 
administration. Many academics admit their behaviour has 
changed to some extent because of the ratings. But 
the authors are sceptical of the benefits, suggesting 
that the most frequently reported change was a reduction 
in coursework demands on the students. On most 
behaviours, "no change" was the most often response. 
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